Skip to content

fix npu grpo#55

Merged
addsubmuldiv merged 2 commits intodevfrom
npu_grpo_fix2
Feb 12, 2026
Merged

fix npu grpo#55
addsubmuldiv merged 2 commits intodevfrom
npu_grpo_fix2

Conversation

@addsubmuldiv
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings February 11, 2026 15:57
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @addsubmuldiv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the stability and compatibility of the system, particularly for NPU (Neural Processing Unit) environments. It resolves critical issues related to distributed training process group initialization by preventing port collisions and improves vLLM integration on NPU by ensuring stable device identification and avoiding incompatible model patching. Additionally, it refines dataset loading to gracefully handle DatasetDict structures, making the data pipeline more robust.

Highlights

  • NPU Process Group Initialization Fixes: Implemented a mechanism to prevent HCCL port collisions in multi-job NPU environments by deriving deterministic port ranges from the MASTER_PORT environment variable. This ensures stable process group initialization for NPU training.
  • vLLM NPU Compatibility Enhancements: Addressed several vLLM-related issues on NPU platforms, including stable device UUID resolution for IPC communication, conditional skipping of monkey_patch_model to avoid Triton compatibility errors, and correct device binding for vLLM workers using local_rank.
  • DatasetDict Handling: Improved dataset loading logic to correctly handle cases where load_dataset returns a DatasetDict instead of a Dataset, ensuring downstream operations like select and map function as expected. This includes robust split resolution and early error reporting for missing splits.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/twinkle/dataset/base.py
    • Imported DatasetDict from datasets.
    • Added logic to normalize dataset objects, converting DatasetDict to a single Dataset by selecting the specified split or 'train' split, or raising a KeyError if no suitable split is found.
  • src/twinkle/model/base.py
    • Imported the os module.
    • Introduced a call to _ensure_hccl_socket_env within _try_init_process_group for the 'hccl' backend to set deterministic HCCL socket environment variables, mitigating port collision issues in multi-job NPU runs.
  • src/twinkle/sampler/vllm_sampler/vllm_engine.py
    • Imported get_vllm_device_uuid from twinkle.utils.platform.
    • Replaced the direct call to current_platform.get_device_uuid(0) with get_vllm_device_uuid(0) for generating the ZMQ handle, providing NPU-specific fallback for device UUID resolution.
  • src/twinkle/sampler/vllm_sampler/vllm_sampler.py
    • Added a conditional check to skip self.engine.engine.collective_rpc("monkey_patch_model") if the platform's device prefix is 'NPU', preventing Triton compatibility errors during sampler initialization.
  • src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py
    • Imported get_vllm_device_uuid from twinkle.utils.platform.
    • Removed the internal _get_device_uuid function.
    • Modified update_weights_from_ipc to pass getattr(self, 'local_rank', None) to Torch.get_device() when initializing the device, ensuring correct device binding for workers.
    • Modified load_synced_weights to also pass getattr(self, 'local_rank', None) to Torch.get_device() for consistent device resolution.
    • Updated _get_zmq_handle to use get_vllm_device_uuid(self.device.index) for generating the device UUID, ensuring stable IPC socket names.
  • src/twinkle/utils/network.py
    • Imported the os module.
    • Defined new environment variable constants for HCCL socket port ranges.
    • Added _derive_hccl_socket_env_defaults function to compute deterministic HCCL port ranges based on a given master_port.
    • Implemented _ensure_hccl_socket_env to set these HCCL environment variables if they are not already defined, preventing port collisions.
    • Called _ensure_hccl_socket_env(master_port) within stateless_init_process_group for the 'hccl' backend to apply the port collision fix.
  • src/twinkle/utils/platform.py
    • Imported hashlib, re, and socket modules.
    • Added _resolve_ascend_physical_device_id to map local NPU device indices to physical IDs using ASCEND_RT_VISIBLE_DEVICES.
    • Implemented _get_npu_bus_id_from_npu_smi to extract NPU Bus-Id from npu-smi info output, providing a stable identifier.
    • Created get_vllm_device_uuid which attempts to use current_platform.get_device_uuid, falls back to NPU Bus-Id if NotImplementedError is raised, and finally resorts to a deterministic SHA1 hash of hostname, visible devices, and device ID if npu-smi is unavailable.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several fixes and improvements for running on NPU platforms. Key changes include handling DatasetDict returns for robustness, preventing HCCL port collisions in multi-job NPU environments, and providing a fallback for get_device_uuid on NPUs to ensure stable IPC communication. The changes are well-reasoned and improve platform compatibility. I've added a couple of minor suggestions to improve code clarity and remove leftover debugging code.

self.device = torch.device(Torch.get_device())
# fix: In some worker paths, omitting local_rank can pick the wrong device / trigger get_device arg issues.
# fix: Pass local_rank when available so each worker binds to the expected local device.
print(f"VLLM Worker local_rank: {getattr(self, 'local_rank', None)} <<<<<<<<<<<<< {Torch.get_device()}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This print statement appears to be for debugging purposes. It should be removed from the production code to avoid cluttering the logs.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@addsubmuldiv addsubmuldiv merged commit 6715192 into dev Feb 12, 2026
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants